-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Musicbrainz reduce memory used by processing by chunks #165
Conversation
This reverts commit e79de48.
…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used
Can you explain in the PR description how this solves the issue? Or this commit (fix: delete buggy code) would be a great place to have a message about what the buggy code was! |
You could also try using a more efficient JSON library like ujson or orjson. The built-in json library is well known to be fine for quick tasks but mot great for speed or memory use. |
…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used
musicbrainz/csv/convert_to_csv.py
Outdated
|
||
CHUNK_SIZE = 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this go at the top with the other configuration constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a comment on why you chose 4096?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a comment on why you chose 4096?
I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a comment on why you chose 4096?
I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.
So, comment that in the code: "4096 was chosen because ChatGPT and StackOverflow examples typically use 4096 or 8192."
In general, always comment (justify) why you chose a value or a method as opposed some other option.
change the type of recording from Q49017950 to Q482994
@candlecao You should commit 805c464 to a different branch... |
musicbrainz/csv/convert_to_csv.py
Outdated
@@ -195,7 +195,7 @@ def convert_dict_to_csv(dictionary_list: list) -> None: | |||
with open( | |||
"temp.csv", mode="a", newline="", encoding="utf-8" | |||
) as csv_records: | |||
writer_records = csv.writer(csv_records) | |||
writer_records = csv.writer(csv_records, delimiter="\t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer comma-separated now... it's a different format.
…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used
I'm not sure what the status of this is. Is it ready to review again? I'm confused by the number of unrelated commits! |
The bug encountered: I moved the code for reading json to other places, but forgot to delete the original part. It causes python to read the JSON twice.
After this is fixed, I read the file metadata and parse it into chunks of 4096 records. This solves the problem.